Adult Website Classifier

نویسنده

  • Saikat Sen
چکیده

The goal of this project was to detect adult websites and pages that are not safe for kids. We use five different techniques. We create an adult vocabulary and use a composite classifier formed of multiple Naïve Bayes classifiers to classify pages based on url, title, keywords and content. We use hue, saturation and histogram of gradients to train random forests, different boosting classifiers and MLP on boxed images for local image classification. The intent is to then use a Viola-Jones or Haar approach to classify images globally. We show an edge-detection technique that works better than Canny’s for some images. We show an extension to Markov chains that can help detect edges. The intent is to use classifiers such as SVMs with Gaussian kernels to use edge information in detecting body parts. Lastly, we propose AdultRank, a ranking metric that serves as an indicator of the adultness of a page. All the techniques together can be used effectively to detect adult web sites and pages. The only overlap this work has with previous related work is in image recognition using the features we have used and edge detection techniques. Introduction Website classification is an old problem. Internet Explorer labels websites as phishing and malware. Google leaves out malware sites from its search results. The goal of this project was to build a classifier that can classify websites and web pages as adult, i.e. sites and web pages that are unsafe for kids. Applications of this classification are many. Parents don’t want their kids exposed to adult content. Some adults find porn images offensive. Some governments ban porn sites and have an ongoing requirement to detect them. Many porn sites have a malware payload and install rootkits, adware, spyware and other viruses, so guarding against them is an additional safety measure. In general, sites and pages could be classified as adult sites based on many factors such as adult images, sexual content, violent content, racist sentiments, extreme radical views etc. The scope of this project is limited to the first two categories. We try to classify pages based on metadata and try to find good classifiers for adult images. Video classification was out of scope for this project but can be done by analyzing individual frames. We also extend Markov’s model, propose a ranking metric AdultRank and propose a new convolution filter for edge detection. Strategy We employ three main techniques: image analysis, text analysis and ranking. For text analysis, we inspect page title, keywords, url and content. For image analysis, we use different image recognition techniques. The OpenCV package was used for image recognition and ML classifiers. For ranking, we propose AdultRank, a ranking metric similar to PageRank. 1 Url Text Classification 1.1 Url Features The following metadata of web pages were used for adult classification:  Meta tag: “rating”. There are some standards that sites can use to indicate adult content but none that we saw use the “rating” meta tag. If the rating tag is found to be adult or restricted, the page is classified as adult. The code was not included in the final toolset since this analysis can be done independently without using machine learning techniques.  Title: if the page title contains an adult word, the page is classified as adult.  Url: if the page title contains an adult word, the page is classified as adult. The chosen implementation is naïve: it looks if the words from an adult dictionary exist in the url as substrings. A proper implementation would parse the url into words with an optimum match, take the site content into consideration and then do a dictionary lookup. As an example, we consider tit to be an adult word but this false positives sites with “title” in the url. As an example of requiring site content to be considered, “google” can be parsed as the mathematical number google, and “go” “ogle” – the site content would help determine which of the two parsings is more appropriate. 

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Link Graph Analysis for Adult Images Classification

ABSTRACT In order to protect an image search engine’s users from undesirable results adult images’ classifier should be built. The information about links from websites to images is employed to create such a classifier. These links are represented as a bipartite website-image graph. Each vertex is equipped with scores of adultness and decentness. The scores for image vertexes are initialized wi...

متن کامل

An Ideal Approach for Detection of Phishing Attacks using Naïve Bayes Classifier

Phishing attack is an aberrant trick to peculate user’s private information by duping them to assail via a spurious website planned to mimic and resembles as an authentic website. The user’s confidential information such as username, password, and PIN number will be grabbed by the attacker and creates a fraudulent transactions. The information holder’s credentials as well as money will be seize...

متن کامل

System Design, Investigation and Countermeasure of Phishing Attacks using Data Mining Classification Methods and its Analysis

The phishing is a kind of e-commerce lure which is intended to steal the confidential information of the internet user by making identical website of legitimate one in which the contents and images most likely remains similar to the legitimate website. The other way of phishing website is to do minor changes in the URL or in the domain of the website. In this paper, an anti-phishing system is p...

متن کامل

Optimizing Precision for Open-World Website Fingerprinting

Traffic analysis attacks to identify which web page a client is browsing, using only her packet metadata — known as website fingerprinting — has been proven effective in closed-world experiments against privacy technologies like Tor. However, due to the base rate fallacy, these attacks have failed in large open-world settings against clients that visit sensitive pages with a low base rate. We f...

متن کامل

A Novel Type-2 Adaptive Neuro Fuzzy Inference System Classifier for Modelling Uncertainty in Prediction of Air Pollution Disaster (RESEARCH NOTE)

Type-2 fuzzy set theory is one of the most powerful tools for dealing with the uncertainty and imperfection in dynamic and complex environments. The applications of type-2 fuzzy sets and soft computing methods are rapidly emerging in the ecological fields such as air pollution and weather prediction. The air pollution problem is a major public health problem in many cities of the world. Predict...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010